R Bootcamp, Module 8: Graphics

August 2019, UC Berkeley

Dana Seidel (built off of material by Kellie Ottoboni and Chris Krogslund)

By way of introduction…

And here’s some motivation - we can produce a plot like this with a few lines of code.

(Compare to the famous gapminder plot.)

Base graphics

The general call for base plot looks something like this:

plot(x = , y = , ...)

Additional parameters can be passed in to customize the plot:

More layers can be added to the plot with additional calls to lines, points, text, etc.

gapChina <- gap %>% filter(country == "China")
plot(gapChina$year, gapChina$gdpPercap)
plot(gapChina$year, gapChina$gdpPercap, type = "l",
     main = "China GDP over time",
     xlab = "Year", ylab = "GDP per capita") # with updated parameters
points(gapChina$year, gapChina$gdpPercap, pch = 16)
points(x = 1977, y = gapChina$gdpPercap[gapChina$year == 1977],
       col = "red", pch = 16)

Other plot types in base graphics

These are a variety of other types of plots you can make in base graphics.

boxplot(lifeExp ~ year, data = gap)
hist(gap$lifeExp[gap$year == 2007])
plot(density(gap$lifeExp[gap$year == 2007]))
barplot(gapChina$pop, width = 4, names.arg = gapChina$year, 
                               main = "China population")

Object-oriented plots

gap_lm <- lm(lifeExp ~ log(gdpPercap) + year, data = gap)

# Calls plotting method for class of the dataset ("data.frame")
plot(gap[,c('pop','lifeExp','gdpPercap')])

# Calls plotting method for class of gap_lm object ("lm"), print first two plots only
plot(gap_lm, which=1:2)

Pros/cons of base graphics, ggplot2, and lattice

Base graphics is

  1. good for exploratory data analysis and sanity checks

  2. inconsistent in syntax across functions: some take x,y while others take formulas

  3. defaults plotting parameters are ugly, and it can be difficult to customize

  4. that said, one can do essentially anything in base graphics with some work

ggplot2 is

  1. generally more elegant

  2. more syntactically logical (and therefore simpler, once you learn it)

  3. better at grouping

  4. able to interface with maps

lattice is

  1. faster than ggplot2 (though only noticeable over many and large plots)

  2. simpler than ggplot2 (at first)

  3. better at trellis graphs than ggplot2

  4. able to do 3d graphs

We’ll focus on ggplot2 as it is very powerful, very widely-used and allows one to produce very nice-looking graphics without a lot of coding.

Basic usage: ggplot2

The general call for ggplot2 graphics looks something like this:

# NOT run
ggplot(data = , aes(x = ,y = , [options])) + geom_xxxx() + ... + ... + ...

Note that ggplot2 graphs in layers in a continuing call (hence the endless +…+…+…), which really makes the extra layer part of the call.

... + geom_xxxx(data = , aes(x = , y = ,[options]), [options]) + ... + ... + ...

You can see the layering effect by comparing the same graph with different colors for each layer

p <- ggplot(data = gapChina, aes(x = year, y = lifeExp)) +
                 geom_point(color = "red")
p
p + geom_point(aes(x = year, y = lifeExp), color = "gray") + ylab("life expectancy") +
    theme_minimal()

And, if you’re desperate for the quick and dirty functionality of base plot, or just like the more familiar syntax at first, ggplot2 offers the qplot() function as a wrapper for most basic plots:

qplot(x = year, y = lifeExp, data = gapChina)
qplot(x = year, y = lifeExp, data = gapChina, geom = "line")

Grammar of Graphics

ggplot2 syntax is very different from base graphics and lattice. It’s built on the grammar of graphics. The basic idea is that the visualization of all data requires four items:

  1. One or more statistics conveying information about the data (identities, means, medians, etc.)

  2. A coordinate system that differentiates between the intersections of statistics (at most two for ggplot, three for lattice)

  3. Geometries that differentiate between off-coordinate variation in kind

  4. Scales that differentiate between off-coordinate variation in degree

ggplot2 allows the user to manipulate all four of these items through the stat_*, coord_*, geom_*, and scale_* functions.

All of these are important to truly becoming a ggplot2 master 🧙‍♂️ 😉, but today we are going to focus on the most important to basic users and their data layers: ggplot2’s geometries

Some Examples

# Scatterplot
ggplot(gapChina, aes(x = year, y = lifeExp)) + geom_point() +
                          ggtitle("China's life expectancy")
# Line (time series) plot
ggplot(gapChina, aes(x = year, y = lifeExp)) + geom_line() +
                          ggtitle("China's life expectancy")
# Boxplot
ggplot(gap, aes(x = factor(year), y = lifeExp)) + geom_boxplot() +
                          ggtitle("World's life expectancy")
# Histogram
gap2007 <- gap %>% filter(year == 2007)
ggplot(gap2007, aes(x = lifeExp)) + geom_histogram(binwidth = 5) +
                          ggtitle("World's life expectancy")

ggplot2 and tidy data

# This combines the subsetting and plotting into one step
gap %>% filter(year == 2007) %>% 
        ggplot(aes(x = lifeExp)) + geom_histogram(binwidth = 5) +
                          ggtitle("World's life expectancy")

For example, here ggplot treats country as an aesthetic parameter that differentiates groups of values, whereas base graphics treats each (year, medal) pair as a set of inputs to the plot.

Here’s ggplot with the data in a tidy format.

# ggplot2 call
head(gap)
##       country year      pop continent lifeExp gdpPercap
## 1 Afghanistan 1952  8425333      Asia  28.801  779.4453
## 2 Afghanistan 1957  9240934      Asia  30.332  820.8530
## 3 Afghanistan 1962 10267083      Asia  31.997  853.1007
## 4 Afghanistan 1967 11537966      Asia  34.020  836.1971
## 5 Afghanistan 1972 13079460      Asia  36.088  739.9811
## 6 Afghanistan 1977 14880372      Asia  38.438  786.1134
ggplot(data = gap, aes(x = year, y = lifeExp)) +
            geom_line(aes(color = country), show.legend = FALSE)

Is that a useful plot?

And here’s use of base graphics, taking advantage of non-tidy, wide-formatted data.

# Base graphics call
gap_wide <- gap %>% select(country, year, lifeExp) %>% spread(country, lifeExp)
gap_wide[1:5, 1:5]
##   year Afghanistan Albania Algeria Angola
## 1 1952      28.801   55.23  43.077 30.015
## 2 1957      30.332   59.28  45.685 31.999
## 3 1962      31.997   64.82  48.303 34.000
## 4 1967      34.020   66.22  51.407 35.985
## 5 1972      36.088   67.69  54.518 37.928
plot(gap_wide$year, gap_wide$China, col = 'red', type = 'l', ylim = c(40, 85))
lines(gap_wide$year, gap_wide$Turkey, col = 'green')
lines(gap_wide$year, gap_wide$Italy, col = 'blue')
legend("right", legend = c("China", "Turkey", "Italy"),
                fill = c("red", "blue", "green"))

Of course, as mentioned above, you can always filter your tidy data to replicate this plot with ggplot2

gap %>%
  filter(country %in% c("China", "Turkey", "Italy")) %>%
  ggplot(aes(x = year, y = lifeExp)) +
  geom_line(aes(color = country))

Pros/cons of ggplot2

An overview of syntax for various ggplot2 geoms

We’ve already seen these initial ones.

X-Y scatter plots: geom_point()

ggplot(gapChina, aes(x = year, y = lifeExp)) + geom_point() +
                          ggtitle("China's life expectancy")

X-Y line plots: geom_line() or geom_path()

ggplot(gapChina, aes(x = year, y = lifeExp)) + geom_line() +
                          ggtitle("China's life expectancy")

Histograms: geom_histogram(), geom_col(), or geom_bar()

gap2007 <- gap %>% filter(year == 2007)
ggplot(gap2007, aes(x = lifeExp)) + geom_histogram(binwidth = 5) +
                          ggtitle("World's life expectancy")

Densities: geom_density(), geom_density2d()

ggplot(gap2007, aes(x = lifeExp)) + geom_density() + 
                          ggtitle("World's life expectancy")

Boxplots: geom_boxplot()

# Notice that here, you must explicitly convert numeric years to factors
ggplot(data = gap, aes(x = factor(year), y = lifeExp)) +
            geom_boxplot() 

“Trellis” plots: facet_grid() or facet_wrap()

ggplot(data = gap, aes(x = lifeExp)) + geom_histogram(binwidth = 5) +
            facet_wrap(~year)

Contour plots: geom_contour()

data(volcano) # Load volcano contour data
volcano[1:10, 1:10] # Examine volcano dataset (first 10 rows and columns)
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]  100  100  101  101  101  101  101  100  100   100
##  [2,]  101  101  102  102  102  102  102  101  101   101
##  [3,]  102  102  103  103  103  103  103  102  102   102
##  [4,]  103  103  104  104  104  104  104  103  103   103
##  [5,]  104  104  105  105  105  105  105  104  104   103
##  [6,]  105  105  105  106  106  106  106  105  105   104
##  [7,]  105  106  106  107  107  107  107  106  106   105
##  [8,]  106  107  107  108  108  108  108  107  107   106
##  [9,]  107  108  108  109  109  109  109  108  108   107
## [10,]  108  109  109  110  110  110  110  109  109   108
volcano3d <- melt(volcano) # Use reshape2 package to melt the data into tidy form
head(volcano3d) # Examine volcano3d dataset (head)
##   Var1 Var2 value
## 1    1    1   100
## 2    2    1   101
## 3    3    1   102
## 4    4    1   103
## 5    5    1   104
## 6    6    1   105
names(volcano3d) <- c("xvar", "yvar", "zvar") # Rename volcano3d columns

ggplot(data = volcano3d, aes(x = xvar, y = yvar, z = zvar)) +
            geom_contour() 

tile/image/level plots, heatmaps: geom_tile(), geom_rect(), geom_raster()

ggplot(data = volcano3d, aes(x = xvar, y = yvar, z = zvar)) +
            geom_tile(aes(fill = zvar)) 

Fitted lines and curves with ggplot2

ggplot(data = gap2007, aes(x = gdpPercap, y = lifeExp)) + geom_point() + scale_x_log10()

# Add linear model (lm) smoother
ggplot(data = gap2007, aes(x = gdpPercap, y = lifeExp)) + geom_point() + scale_x_log10() +
  geom_smooth(method = "lm")

# Add local linear model (loess) smoother, span of 0.75 (more smoothed)
ggplot(data = gap2007, aes(x = gdpPercap, y = lifeExp)) + geom_point() + scale_x_log10() +
  geom_smooth(method = "loess", span = .75)

# Add local linear model (loess) smoother, span of 0.25 (less smoothed)
ggplot(data = gap2007, aes(x = gdpPercap, y = lifeExp)) + geom_point() + scale_x_log10() +
  geom_smooth(method = "loess", span = .25)

# Add linear model (lm) smoother, no standard error shading
ggplot(data = gap2007, aes(x = gdpPercap, y = lifeExp)) + geom_point() + scale_x_log10() +
  geom_smooth(method = "lm", se = FALSE)

# Add local linear model (loess) smoother, no standard error shading
ggplot(data = gap2007, aes(x = gdpPercap, y = lifeExp)) + geom_point() + scale_x_log10() +
  geom_smooth(method = "loess", se = FALSE)

Anatomy of aes()

# NOT run
ggplot(data = , aes(x = , y = , color = , linetype = , shape = , size = ))

These four aesthetic parameters (color, linetype, shape, size) can be used to show variation in kind (categories) and variation in degree (numeric).

Parameters passed into aes should be variables in your dataset.

Parameters passed to geom_xxx outside of aes should not be related to your dataset – they apply to the whole figure.

ggplot(data = gap, aes(x = year, y = lifeExp)) +
            geom_line(aes(color = country), show.legend = FALSE)

Note what happens when we specify the color parameter outside of the aesthetic operator. ggplot2 views these specifications as invalid graphical parameters.

ggplot(data = gap, aes(x = year, y = lifeExp)) +
            geom_line(color = country)
## Error in layer(data = data, mapping = mapping, stat = stat, geom = GeomLine, : object 'country' not found
ggplot(data = gap, aes(x = year, y = lifeExp)) +
            geom_line(color = "country")
## Error in grDevices::col2rgb(colour, TRUE): invalid color name 'country'
## this works but only makes sense if we restrict to one country
ggplot(data = gapChina, aes(x = year, y = lifeExp)) +
            geom_line(color = "red")

Note: Aesthetics automatically show up in your legend, parameters (those not mapped to a variable in your data frame) do not!

Using aesthetics to highlight features

Differences in kind

## color as the aesthetic to differentiate by continent
ggplot(data = gap2007, aes(x = gdpPercap, y = lifeExp)) +
            geom_point(aes(color = continent)) + scale_x_log10()

## point shape as the aesthetic to differentiate by continent
ggplot(data = gap2007, aes(x = gdpPercap, y = lifeExp)) +
            geom_point(aes(shape = continent)) + scale_x_log10()

## line type as the aesthetic to differentiate by country
gapOceania <- gap %>% filter(continent %in% 'Oceania')
ggplot(data = gapOceania, aes(x = year, y = lifeExp)) +
            geom_line(aes(linetype = country)) + scale_x_log10()

Differences in degree

## point size as the aesthetic to differentiate by population
ggplot(data = gap2007, aes(x = gdpPercap, y = lifeExp)) +
            geom_point(aes(size = pop)) + scale_x_log10()

## color as the aesthetic to differentiate by population
ggplot(data = gap2007, aes(x = gdpPercap, y = lifeExp)) +
            geom_point(aes(color = pop)) + scale_x_log10() +
            scale_color_gradient(low = 'lightgray', high = 'black')

Multiple non-coordinate aesthetics (differences in kind using color, degree using point size)

ggplot(data = gap2007, aes(x = gdpPercap, y = lifeExp)) +
            geom_point(aes(size = pop, color = continent)) + scale_x_log10()

Scaling Aesthetics:

Aesthetics are handled by their very own scale functions which allow you to set the limits, breaks, tranformations, and any palletes that might determine how you want your data plotted. ggplot2 includes a number of helpful default scale functions like scale_x_log10 that can tranform your data on the fly or scale_color_viridis which uses palettes from the viridis package specifically designed to “make plots that are pretty, better represent your data, easier to read by those with colorblindness, and print well in grey scale.”

For example, our data might be better represented using a log10 transformation of per capita GDP:

ggplot(gap, aes(x = gdpPercap, y = lifeExp)) +
            geom_point(aes(color = continent)) +
  scale_x_log10()

And perhaps we want colors that are a little different:

ggplot(gap, aes(x = gdpPercap, y = lifeExp)) +
            geom_point(aes(color = continent)) +
  scale_x_log10() +
  scale_color_viridis_d()

Or perhaps we want to set your palettes and breaks or labels manually:

ggplot(gap, aes(x = gdpPercap, y = lifeExp)) +
            geom_point(aes(color = continent)) +
  scale_x_log10(labels = scales::dollar) +
  scale_color_manual("Continent", 
                     values = c("red", "blue", "green", "yellow", "#800080")) # hex codes work!

For more info about setting scales in ggplot2 and for more helper functions consider diving into the scales package which is the backend to much of the scales functionality in ggplot2

Fine tuning your plot

ggplot handles many plot options as additional layers.

Labels

ggplot(data = gap2007, aes(x = gdpPercap, y = lifeExp)) + geom_point() +
  xlab(label = "GDP per capita") +
  ylab(label = "Life expectancy") +
  ggtitle(label = "Gapminder") 

Or even more simply use the labs() function

ggplot(data = gap2007, aes(x = gdpPercap, y = lifeExp)) + geom_point() +
  labs(x = "GDP per capita", y = "Life expectancy", title = "Gapminder")

Axis and point scales

ggplot(data = gap, aes(x = gdpPercap, y = lifeExp)) +
            geom_point() 
ggplot(data = gap, aes(x = gdpPercap, y = lifeExp)) +
            geom_point(size=3) 
ggplot(data = gap, aes(x = gdpPercap, y = lifeExp)) +
            geom_point(size=1) 

Colors

ggplot(data = gap, aes(x = gdpPercap, y = lifeExp)) +
            geom_point(color = colors()[11]) 
ggplot(data = gap, aes(x = gdpPercap, y = lifeExp)) +
            geom_point(color = "red") 

Point Styles and Widths

ggplot(data = gap, aes(x = gdpPercap, y = lifeExp)) +
            geom_point(shape = 3) 
ggplot(data = gap, aes(x = gdpPercap, y = lifeExp)) +
            geom_point(shape = "w") 
ggplot(data = gap, aes(x = gdpPercap, y = lifeExp)) +
            geom_point(shape = "$", size=5) 

Line Styles and Widths

ggplot(data = gapChina, aes(x = year, y = lifeExp)) +
            geom_line(linetype = 1) 
ggplot(data = gapChina, aes(x = year, y = lifeExp)) +
            geom_line(linetype = 2) 
ggplot(data = gapChina, aes(x = year, y = lifeExp)) +
            geom_line(linetype = 5, size = 2) 

Themes with ggplot2

Elements of the plot not associated with geometries can be adjusted using ggplot themes.

There are some “complete” themes already included with the package: - theme_gray() (the default) - theme_minimal() - theme_bw() - theme_light() - theme_dark() - theme_classic()

But in additional to these, you can tweak just about any element of your plot’s appearance using the theme() function.

For instance, perhaps you want to move the legend from the left to the bottom of your plot, this would be part of the plot theme. Note how you can add options to a complete theme already in the plot:

gap %>%
  filter(country %in% c("China", "Turkey", "Italy")) %>%
  ggplot(aes(x = year, y = lifeExp)) +
  geom_line(aes(color = country)) +
  theme_minimal() + 
  theme(legend.position = "bottom")

Combining Multiple Plots

# Initialize gridExtra library
library(gridExtra)

# Create 3 plots to combine in a table
plot1 <- ggplot(data = gap2007, aes(x = gdpPercap, y = lifeExp)) +
  geom_point() + scale_x_log10() + annotate('text', 150, 80, label = '(a)')
plot2 <- ggplot(data = gap2007, aes(x = pop, y = lifeExp)) +
  geom_point() + scale_x_log10() + annotate('text', 1.8e5, 80, label = '(b)')
plot3 <- ggplot(data = gap, aes(x = year, y = lifeExp)) +
      geom_line(aes(color = country), show.legend = FALSE) +
      annotate('text', 1951, 80, label = '(c)')


# Call grid.arrange
grid.arrange(plot1, plot2, plot3, nrow=3, ncol = 1)

patchwork: Combining Multiple ggplot2 plots

# Install and initialize patchwork library
# devtools::install_github("thomasp85/patchwork")
library(patchwork)

# use the patchwork operators
# stack plots horizontally
plot1 + plot2 + plot3

# stack plots vertically
plot1 / plot2 / plot3

# side-by-side plots with third plot below
(plot1 | plot2) / plot3

# side-by-side plots with a space in between, and a third plot below
(plot1 | plot_spacer() | plot2) / plot3

# stack plots vertically and alter with a single "gg_theme"
(plot1 / plot2 / plot3) & theme_bw()

Feel free to explore more at https://github.com/thomasp85/patchwork.

Note: patchwork is an example of a ggplot2 extension package of which there are many! One of the benefits to learning and using ggplot2 is that there is a huge community of developers that build separate graphics packages that generally use the same syntax to extend the ggplot2 functionality into things like animation and 3D plotting! Check them out –> http://www.ggplot2-exts.org/gallery/

Exporting

Two basic image types:

Raster/Bitmap (.png, .jpeg)

Every pixel of a plot contains its own separate coding; not so great if you want to resize the image

jpeg(filename = "example.jpg", width=, height=)
plot(x,y)
dev.off()

Vector (.pdf, .ps)

Every element of a plot is encoded with a function that gives its coding conditional on several factors; great for resizing

# NOT run
pdf(file = "example.pdf", width=, height=)
plot(x,y)
dev.off()

Exporting with ggplot

# NOT run

# Assume we saved our plot is an object called `plot1`.

ggsave(filename = "example.pdf", plot = plot1, scale = , width = ,
       height = )

Breakout

These questions ask you to work with the gapminder dataset.

Basics

  1. Plot a histogram of life expectancy.

  2. Plot the gdp per capita against population. Put the x-axis on the log scale.

  3. Clean up your scatterplot with a title and axis labels. Output it as a PDF and see if you’d be comfortable with including it in a report/paper.

Using the ideas

  1. Create a trellis plot of life expectancy by gdpPercap scatterplots, one subplot per continent. Use a 2x3 layout of panels in the plot. Now have the size of the points vary with population. Use scale_x_continuous() to set the x-axis limits to be in the range from 100 to 50000.

  2. Make a boxplot of life expectancy conditional on binned values of gdp per capita.

Advanced

  1. Using the data for 2007, recreate as much as you can of this famous Gapminder plot, where the colors are different continents. (Don’t worry about the ‘2015’ in the background and ignore the ‘play’ button at the bottom.)

  2. Create a “trellis” plot where, for a given year, each panel uses a) hollow circles to plot lifeExp as a function of log(gdpPercap), and b) a red loess smoother without standard errors to plot the trend. Turn off the grey background. Figure out how to use partially-transparent points to reduce the effect of the overplotting of points.